Repairing Regular Expressions for Extraction
نویسندگان
چکیده
While synthesizing and repairing regular expressions (regexes) based on Programming-by-Examples (PBE) methods have seen rapid progress in recent years, all existing works only support or regexes for membership testing, the extraction is still an open problem. This paper fills void by proposing first PBE-based method extraction. Our work supports that real-world extensions such as backreferences lookarounds. The significantly affect synthesis repair In fact, we show there are unsolvable instances of problem if synthesized not allowed to use extensions, i.e., no regex without correctly classify given set examples, whereas every instance solvable allowed. stark contrast case where guaranteed a solution expressible pure extensions. main contribution algorithm solve builds enumerative search algorithms with SMT constraint solving. However, significant needed because constraints previous non-deterministic semantics regexes. Non-deterministic sound but extraction, which substrings extracted depends deterministic behavior actual engines. To address issue, propose new generation respects For this, define novel formal engine big-step operational semantics, it basis design method. key idea simulate determinism consider continuations matching them disambiguation. We also two space pruning techniques called approximation-by-pure-regex approximation-by-backreferences make information examples. implemented tool R3 (Repairing Regex extRaction) evaluated 50 contain evaluation shows effectiveness our substantially prune space.
منابع مشابه
Repairing Data through Regular Expressions
Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use them for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revision cost. The proposed method contains two steps, sequence repair and token value repair. For s...
متن کاملRepairing Regular Expressions by Adding Missing Words
Regular expressions are used in many information extraction systems like YAGO, DBpedia, Gate and SystemT. However, they sometimes do not match what their creator wanted to find. We investigate how missing words can be added automatically to a regular expression by creating disjunctions at the appropriate positions. Our demo visualizes the steps that our algorithm employs to repair the regular e...
متن کاملEfficient Submatch Extraction for Practical Regular Expressions
Internal Posting Date: March 6, 2012 [Fulltext] Efficient Submatch Extraction for Practical Regular Expressions Stuart Haber, William Horne, Pratyusa Manadhata, Miranda Mowbray, Prasad Rao HP Laboratories HPL-2012-41R1 regular expressions; submatch extraction; capturing groups A capturing group is a syntax used in modern regular expression implementations to specify a subexpression of a regul...
متن کاملExplanations for Regular Expressions
Regular expressions are widely used, but they are inherently hard to understand and (re)use, which is primarily due to the lack of abstraction mechanisms that causes regular expressions to grow large very quickly. The problems with understandability and usability are further compounded by the viscosity, redundancy, and terseness of the notation. As a consequence, many different regular expressi...
متن کاملRegular Expressions for Provenance
As noted by Green et al. several provenance analyses can be considered a special case of the general problem of computing formal polynomials resp. power-series as solutions of an algebraic system. Specific provenance is then obtained by means of evaluating the formal polynomial under a suitable homomorphism. Recently, we presented the idea of approximating the least solution of such algebraic s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ACM on programming languages
سال: 2023
ISSN: ['2475-1421']
DOI: https://doi.org/10.1145/3591287